--- Log opened Fri May 15 00:00:13 2020 |
00:28 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!] |
00:32 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code |
00:32 | | mode/#code [+o celticminstrel] by ChanServ |
00:45 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed] |
00:45 | | Vorntastic [uid293981@Nightstar-h2b233.irccloud.com] has quit [[NS] Quit: Connection closed for inactivity] |
01:36 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!] |
01:40 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code |
01:40 | | mode/#code [+o celticminstrel] by ChanServ |
01:40 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!] |
01:41 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code |
01:41 | | mode/#code [+o celticminstrel] by ChanServ |
01:42 | | Kindamoody is now known as Kindamoody|zZz] |
02:48 | | Alek [Alek@Nightstar-o723m2.cicril.sbcglobal.net] has quit [Ping timeout: 121 seconds] |
02:53 | | Alek [Alek@Nightstar-o723m2.cicril.sbcglobal.net] has joined #code |
02:53 | | mode/#code [+o Alek] by ChanServ |
04:23 | | Degi [Degi@Nightstar-hnn55p.dyn.telefonica.de] has quit [Ping timeout: 121 seconds] |
04:27 | | Degi [Degi@Nightstar-vq98oj.dyn.telefonica.de] has joined #code |
04:53 | | Vorntastic [uid293981@Nightstar-ks9.9ff.184.192.IP] has joined #code |
04:53 | | mode/#code [+qo Vorntastic Vorntastic] by ChanServ |
05:50 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!] |
06:26 | | catalyst [catalyst@Nightstar-3cqj7v.dab.02.net] has joined #code |
06:28 | | catalyst_ [catalyst@Nightstar-1dili4.dab.02.net] has quit [Ping timeout: 121 seconds] |
07:42 | | Kindamoody|zZz] is now known as Kindamoody |
07:57 | <&McMartin> | Eesh. |
07:57 | <&McMartin> | One of the companies I'm interviewing with casually suggested doing some drills with codewars.com before the interview |
07:57 | <&McMartin> | This stuff is an insult |
07:58 | <&McMartin> | Most of the early stuff is ripped straight out of Exercism, and half the UI is dripping with testosterone poisoning |
08:33 | | Kindamoody is now known as Kindamoody|afk |
08:47 | | mac is now known as macdjord|slep |
09:03 | < catalyst> | I find those kinds of sites really frustrating |
09:03 | < catalyst> | I don't code that way |
09:04 | < catalyst> | the fact that companies use those kinds of things as a first instance interview technique is one of the reasons I'm scared to try and find a new job |
10:52 | <&jeroud> | Doesn't codewars.com predate exercism? |
10:53 | <&jeroud> | I may be thinking of a different thing, though. |
10:55 | <&jeroud> | Reiver: One of the standard solutions to the problem of "might crash before releasing the lock" is to use a lease instead. You write a timestamp with the RUNNING state and the next process can check if the timestamp is older than ten minutes (or timeout seems reasonable) and assume the previous run crashed (or timed out) without unlocking. |
10:57 | <&jeroud> | *whatever timeout seems reasonable |
10:59 | <&jeroud> | Oh, having looked at the codewars website it's definitely a different thing I was thinking of. |
11:08 | <&jeroud> | Also, having spent a bunch of time in the "cloud/container infrastructure" world recently, Reiv's horrible baton system looks a whole lot better than a lot of "industry standard" things I have to deal with. |
11:08 | <&Reiver> | jeroud: That's actually what the original code did, in fact |
11:09 | <&Reiver> | It would check for RUNNING and Timestamp > 15 Minutes Old |
11:09 | <&jeroud> | Ah. Why the change? |
11:10 | <&Reiver> | Because the fucking scripting language could not be trusted to handle times properly, and would get confused as to what "Fifteen minutes ago" meant, no matter how you algebrae'd it. |
11:10 | <&jeroud> | Bleh. |
11:11 | <&Reiver> | And it then became apparent in acutal use that no /genuine/ run took more than 3 minutes by standard, and the single longest exception was 7, and a crashout |
11:12 | <&Reiver> | And acutal use also demonstrated we could trust the system /slightly/ more than hoped, so that a ten minute spacing was reliable enough, and also we discovered it would simply crash in a /different/ way even if there was colissions (Discovered by deliberately setting two of them loose simultaneously), so we just kinda... shrugged |
11:13 | <&Reiver> | (Two simultaneous: One would run, release the file, grab the next, and the next one after would pick up the just-run file and process it too... on the same parameters, so thus producing the same output, because the file hadn't changed yet. Until the first one hit a Really Big File, then the second one would hit a permissions lock, crash out, and ... welp?) |
11:13 | <&jeroud> | I wish I could do that with my current horrible problem. |
11:13 | | * Reiver prefers not to run systems based on Scripts Crashing Out, hence the "If you spot it previously running, hardcode it back to idle, go to sleep and try again in 5 minutes" |
11:14 | <&Reiver> | Thus buying 10 minute seperation between Suspected Crash and Next Attempt |
11:14 | <&Reiver> | But it was something I was comfortable dealing with when I knew that the scripts couldn't actually /break/ on collisions, either. |
11:16 | <&jeroud> | I have a Very Big Database. It contains all messages and metadata from about a dozen WHO COVID-19 WhatsApp services. |
11:16 | <&Reiver> | jinkies |
11:16 | <&Reiver> | (Also: Props to you) |
11:16 | <&jeroud> | It grows as a rate of about a million messages a day. |
11:16 | <&Reiver> | (Though for once in my life I can say I was Doing Useful Things Too; as I helped hammer together systems for our own COVID response.) |
11:17 | <&jeroud> | We cannot do analytics or reporting on it, because Big. |
11:17 | <&Reiver> | (... even if they were only online dog registration forms, to allow people to register doggies without coming on-site. It counts!) |
11:17 | <&jeroud> | So we have to push all the data to a different thing for that. |
11:17 | <&Reiver> | That does sound like something that would need agressive management, yes |
11:18 | <&jeroud> | The correct way to do this would be for the app to shove events into a queue. |
11:19 | <&jeroud> | Except the app team are both very busy working on somewhat more immediate problems. |
11:19 | <&jeroud> | (Yes, both. The app team is two people.) |
11:20 | <&jeroud> | Since this is in AWS, we need to get the data from RDS (managed postgres, basically) to redshift (looks like postgres, but designed more for analytics than operstions). |
11:21 | <&Reiver> | ... and you need to do this /while/ the firehose of information comes flooding in. |
11:21 | <&Reiver> | yey |
11:21 | <&jeroud> | The standard way to do this is with the aptly-named Database Migration Service. |
11:22 | <&jeroud> | Which *can* do streaming replication stuff, but only if you flip scary switches on your primary db. Which we aren't willing to do. |
11:24 | <&jeroud> | So instead I wrote some code to manage DMS tasks (which is a nightmare story for another time) and do periodic migrations of data filtered by the inserted_at column that helpfully exists in all tables. |
11:25 | <&Reiver> | Why are the switches scary, in this case? |
11:25 | <&jeroud> | (Most of the tables are small enough that we can just dump and replace them, but several need batch updates instead.) |
11:26 | <&jeroud> | The switch is turning on "logical replication" which, unlike the usual postgres streaming replication, writes a description of each update operation as it happens. |
11:26 | <&jeroud> | If we'd started with this, it wouldn't be scary. |
11:27 | <&jeroud> | But I only realised that our analytics wouldn't work on a normal replica a few days in when hourly queries were already taking ten minutes or more to run. |
11:29 | <&jeroud> | At which point making a potentially significant operational change to the primary db without much in the way of docs around what the impact would be seemed unwise. |
11:29 | <&jeroud> | (If we could do that from a replica instead of the primary, it wouldn't have been an issue.) |
11:30 | <&jeroud> | Anyway, hourly DMS runs to migrate the data looked like a tolerable workaround. |
11:30 | <&jeroud> | And it was fine for a while. |
11:31 | <&Reiver> | Problem is, your Time To A Million is measured in hours, yeah |
11:31 | <&jeroud> | Until tasks started failing sporadically. |
11:31 | <&jeroud> | Hours, ha! |
11:33 | <&jeroud> | In the half a day before this thing launched officially we already had about five million interactions, which is ten million messages. |
11:33 | <&jeroud> | (It dropped off significantly once the novelty wore off, which took a couple of weeks.) |
11:34 | <&jeroud> | Anyway, we now have DMS tasks that sometimes fail. |
11:34 | <&Reiver> | painful |
11:34 | <&jeroud> | And the filters on data only look at the source db, not the destination. |
11:35 | <&jeroud> | So if we migrate data that has already been migrated, we get duplicates. |
11:35 | <&Reiver> | oof |
11:36 | <&Reiver> | ... yeah, ironically enough that's exactly what my Horrible Baton System was built to handle |
11:36 | <&jeroud> | So when something fails we need to delete any partial data before we restart. |
11:36 | <&Reiver> | "I need arbitary history and I don't trust anything at either end but also can't just copy everything every time, I will give you data until you can promise me you've seen the last lot" |
11:36 | <&Reiver> | There would be more aggressive methods for trimming, but in my scenario it was fine enough to just trim the file once it arrived |
11:37 | <&Reiver> | You're in a rough one there, I feel for you lad |
11:37 | <&jeroud> | Also, DMS filters are all inclusive. You get ">=", "<=", or "between". |
11:38 | <&jeroud> | So if you happen to have a timestamp that falls exactly on a migration boundary, you get duplicates anyway. |
11:40 | <&jeroud> | I have now spent *way* too much time building a thing that talks to both DMS and redshift, unconditionally deletes everything from the target tables that has a timestamp >= the start time, and then runs the tasks. |
11:40 | <&jeroud> | But detecting failures is... complicated. |
11:41 | <&jeroud> | So while my failure detection works so far, I'm not confident enough to rely on it to run unsupervised. |
11:42 | <&jeroud> | And if there's replication lag, we can easily get a successful run that misses the last ten minutes of data for the hour because that hasn't yet reached the replica we're migrating from. |
11:43 | <&jeroud> | So I must run this thing manually several times a day. |
11:43 | <&jeroud> | And watch for failures so I can rerun if necessary. |
11:46 | <&jeroud> | Oh, another fun fact: There are no indexes on the inserted_at columns. |
11:48 | <&Reiver> | >_< |
11:48 | <&Reiver> | Yeah of all the columns to have had indexed... |
11:48 | <&jeroud> | So the DMS tasks end up doing full table scans of up to a hundred million rows to get the hundred thousand or so that actually need to be sent. |
11:48 | <&Reiver> | Primary key, and /timestamp/ goddamnit people |
11:48 | <&Reiver> | Pity you can't manipulate pagination or the like |
11:49 | <&Reiver> | But that is dark sorcery I dare not touch myself anyway |
11:49 | <&jeroud> | To be fair, this database was never intended to be used for analytics. |
11:50 | <&Reiver> | no, but analytics follow rapidly behind any sufficiently sized data set |
11:50 | <&Reiver> | So it is generally wise to plan for them from the start >_> |
11:51 | <&jeroud> | The primary keys for the two biggest tables are autoincrement integers, so as of yesterday I include a filter on that with a hard-configured minimum. |
11:51 | <&Reiver> | Well hey, it's not a bad start |
11:51 | <&Reiver> | If you can tune it more closely, you can manipulate that further to be your baton, even if it's not time based |
11:52 | <&jeroud> | We'll have to manually update the config every few million rows, but at least the query optimizer is smart enough that it's never actually *slower*. |
11:52 | <&Reiver> | (Or if not true baton, then at least be a steadily updated cutoff) |
11:53 | <&Reiver> | Is your data recorded continiously, or are your primary keys non unique and/or nonconsecutive? |
11:53 | <&Reiver> | (I know you said they're autoincrement, but ha ha ha I never assume the three are inexorably linked these days, boy do I have tales to tell you -_-) |
11:54 | <&jeroud> | These are postgres autoincrement primary key things. |
11:54 | <&Reiver> | I mean, as long as every time you run it you know what the new maximum is, you can min that on the next set of data, or even min+1 if it helps your cutoffs |
11:55 | <&Reiver> | Provided there is /something/ that 'bigger means later' and 'numbers are indexed', you should be able to use it as a cutoff even if it's not a timestamp |
11:55 | <&Reiver> | You only care about order of data, not actual age, after all |
11:55 | <&jeroud> | I *think* that means monotonically increasing as long as nothing ever sets a value for it on anything. |
11:57 | <&Reiver> | Then yeah, you should be able to use that as a cutoff for "Give me data after X", even if it's not "After X-oclock" |
11:57 | <&Reiver> | Just hang onto the last Biggest Value you saw, use it as your new next minimum. |
11:57 | <&jeroud> | We could probably look at the latest id in each batch table in the destination db and build the DMS tasks from that, but it adds more complexity than I'm comfortable with. |
11:57 | <&Reiver> | It is, in theory, only managing the one variable. |
11:58 | <&Reiver> | Well, two, I guess. |
11:58 | <&Reiver> | This Time and Last Time |
11:58 | <&Reiver> | But I also appreciate you're playing with an extremely large production system without a lot of testing support, it seems like |
11:58 | <&jeroud> | It wouldn't help with partial failures, though, because there's no ordering guarantee in the target or the migration. |
11:59 | <&Reiver> | But I will say that if you are updating the number manually, you might as well automate it, and 'the point the hunk of data gets shifted over' is the right time to grab the New Biggest Number out. |
11:59 | <&Reiver> | Ah, you could have a partial failure that dumps out 1 2 3 5 6 and lost 4 and 7 both? |
12:00 | <&jeroud> | Yeah. |
12:00 | <&jeroud> | But that shows up as failure at least, so we can delete the whole range and retry. |
12:01 | <&Reiver> | Right |
12:02 | <&Reiver> | And at least you're only taking From That Last Cut again. |
12:02 | <&jeroud> | We still need an end time, though, because we don't support updates to already-migrated data, so we leave a few minutes for things like delivery statuses to arrive. |
12:02 | <&Reiver> | That's fine, actually |
12:03 | <&jeroud> | So we'll need to filter on that anyway. |
12:03 | <&Reiver> | Grab cutoff, grab max, return row at 90th percentile between them, check its timestamp. Old enough? Push on! |
12:04 | <&Reiver> | That way you only hit the time column /after/ scanning the indexed primary keys >_> |
12:04 | <&jeroud> | Which means the only real benefit we get is reduced runtime, and that's offset by having to query the target for max ids before we start. |
12:05 | <&Reiver> | Ooooor you just ask the system what it's current variable is up to |
12:05 | <&Reiver> | I don't know postgres explicitly, but I sure as heck know you can query an Sequence in Oracle directly to ask what it's up to |
12:06 | <&Reiver> | You don't even care if it's a couple rows out of date by the time you've done your math, you only needed it for a range estimation anyway |
12:08 | <&Reiver> | You can even get pointlessly clever with the math on that one, but I suspect that's getting into technowank rather than strictly useful |
12:09 | <&Reiver> | ("How long ago since we last successfully ran? Okay, so ten minutes would be what percentage? Excellent, let's call that percentile of the data minus some padding, and check the age. If it's too young, push back whatever% and try again...") |
12:11 | <&jeroud> | That's much more complexity than I'm comfortable with in a system like this, especially since the only benefit we get is slightly reduced runtime. |
12:11 | <&Reiver> | The latter bit in brackets was explicitly overkill |
12:11 | <&Reiver> | What I'm trying to save, here, is database scans |
12:12 | <&jeroud> | Also, it takes DMS a minute or two to update the tasks. |
12:12 | <&Reiver> | How much of it is needed depends entirely on what level of scaling you're trying to grapple with, but I figure noting that you /can/ get a minimum and a maximum without actually scanning the full database ahead of time may be of value to you in case it turns out every single query is a ballbreaker. :) |
12:13 | <&Reiver> | And that you /can/ even then use this to find what /should/ be a useful looking cutoff point, without /ever/ doing a full scan on the timestamp columns, is an option if full scans are causing you performance/reliability problems. |
12:13 | <&jeroud> | With filters based purely on configs (for the minimum ids) and inputs (for the time ranges) I can do that in parallel with deleting any potential duplicates from the target. |
12:15 | <&jeroud> | The full scans are only a performance issue. The most common failure seems to be a race condition outside our control. |
12:16 | <&jeroud> | The target side of the migration works by writing a file full of rows to S3 and then telling redshift to bulk-import that file. |
12:18 | <&jeroud> | Sometimes redshift asks S3 for the file before the part of S3 that redshift is talking to hears about the file from the part of S3 that DMS is talking to. |
12:18 | <&jeroud> | At least, that's what I *think* is happening. |
12:19 | <&jeroud> | So it's effectively random, although failures seem to be temporally clustered just enough that I suspect some kind of S3 load issue or something. |
12:20 | <&jeroud> | Without the config-based id filtering, a batch load takes 8-9 minutes. |
12:21 | <&jeroud> | With the config-based id filtering, it takes 3-4 minutes. |
12:21 | <&jeroud> | Currently. |
12:22 | <&Reiver> | gotta love them race conditions :/ |
12:22 | <&Reiver> | You're staging files, right? One location takes the output, then moves it to an input location before the next system can touch it? |
12:23 | <&Reiver> | Not always needed, ofc, but in our case 'both systems were trying to write and read the file simultaneously' was a common form of system failure -_- |
12:23 | <&jeroud> | Nope, we're gluing opaque cloud services together. |
12:23 | <&Reiver> | Yeah that was our problem too |
12:23 | <&Reiver> | Couldn't control the export or import |
12:23 | <&Reiver> | So we exported out, then shifted files over when they were done before they did the next bit. |
12:24 | <&jeroud> | We *could* have built our own from scratch, but DMS handles all the necessary type conversions and stuff for us. |
12:24 | <&Reiver> | Yeah that's fair |
12:24 | <&Reiver> | (Gods, our system ended up such a rube goldberg machine just dealing with the limitations of every step of the process...) |
12:27 | <&jerith> | At this point, though, it would probably have taken us less time to build our own. |
12:28 | <&jerith> | But I wasn't expecting DMS to be *quite* this godsafully terrible. |
12:28 | <&jerith> | +w |
12:31 | <&Reiver> | If you ever get suggested to use a peice of software called Attunity |
12:31 | <&Reiver> | For anything other than multi-gigabyte file transfers that for some reason can't be trusted to SFTP |
12:31 | <&Reiver> | run. |
12:31 | <&Reiver> | Run very fast indeed. |
12:31 | <&Reiver> | It does not do the mirroring they claim they do, and is responsible for a full two thirds of our rube goldberging to suit~ |
12:33 | <&jerith> | Heh. File transfers. |
12:34 | <&jerith> | We have had to explain to two government entities that FTP is a terrible idea even if it's over ipsec, FTPS is even worse, and SFTP doesn't need all the ipsec madness because it's already ssh. |
12:36 | <&jerith> | One of them has thankfully gone away. The other is now switching to HTTPS uploads (which they're calling an "API", but still). |
12:37 | <&jerith> | Except they don't trust CA-issued certs, so they want a copy of the cert we'll be using to have their system accept only that cert. |
12:38 | <&jerith> | Which means that we can't haz letsencrypt and automated renewals for this. |
12:38 | <&ToxicFrog> | Wait, how is FTPS even worse than unencrypted FTP? |
12:38 | <&jerith> | FTPS over ipsec is worse than FTP over ipsec. |
12:40 | <&jerith> | Because if your channel is encrypted already, all FTPS adds is complexity. |
12:41 | <&ToxicFrog> | Oh, I missed that you were comparing them both on the assumption that ipsec is in use |
12:44 | <&jerith> | Our current solution (which we're probably in the middle of deploying, I don't work Fridays) is to stick a self-signed cert on self-signed.project.domain and have our load balancer point that to the same backend as project.domain. They seem happy with that. |
12:49 | | celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code |
12:49 | | mode/#code [+o celticminstrel] by ChanServ |
12:56 | | catalyst_ [catalyst@Nightstar-sg536p.dab.02.net] has joined #code |
12:59 | | catalyst [catalyst@Nightstar-3cqj7v.dab.02.net] has quit [Ping timeout: 121 seconds] |
13:07 | | VirusJTG [VirusJTG@Nightstar-42s.jso.104.208.IP] has quit [Connection closed] |
13:22 | | VirusJTG [VirusJTG@Nightstar-42s.jso.104.208.IP] has joined #code |
13:22 | | mode/#code [+ao VirusJTG VirusJTG] by ChanServ |
13:36 | < catalyst_> | I've had 420 downloads |
13:36 | < catalyst_> | Nice. |
13:41 | <~Vorntastic> | Nice |
13:50 | <&ToxicFrog> | n i c e |
15:04 | | Kindamoody|afk is now known as Kindamoody |
15:14 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code |
15:15 | | mode/#code [+qo Vornicus Vornicus] by ChanServ |
15:57 | | bluefoxx [fuzzylombax@Nightstar-gmbj85.vs.shawcable.net] has quit [Connection closed] |
16:02 | | bluefoxx [fuzzylombax@Nightstar-gmbj85.vs.shawcable.net] has joined #code |
16:13 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed] |
16:49 | | Emmy [Emmy@Nightstar-9p7hb1.direct-adsl.nl] has joined #code |
17:13 | | Vorntastic [uid293981@Nightstar-ks9.9ff.184.192.IP] has quit [[NS] Quit: Connection closed for inactivity] |
19:29 | | catalyst_ [catalyst@Nightstar-sg536p.dab.02.net] has quit [Ping timeout: 121 seconds] |
20:48 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code |
20:48 | | mode/#code [+qo Vornicus Vornicus] by ChanServ |
22:28 | | catalyst [catalyst@Nightstar-tnqd6v.dab.02.net] has joined #code |
23:22 | | Syloq [Syloq@NetworkAdministrator.Nightstar.Net] has joined #code |
23:22 | | mode/#code [+o Syloq] by ChanServ |
23:26 | | Emmy [Emmy@Nightstar-9p7hb1.direct-adsl.nl] has quit [Ping timeout: 121 seconds] |
23:28 | | Vornotron [Vorn@ServerAdministrator.Nightstar.Net] has joined #code |
23:31 | | Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Ping timeout: 121 seconds] |
23:37 | | Kindamoody is now known as Kindamoody|zZz] |
--- Log closed Sat May 16 00:00:15 2020 |