#code.20200515.log

code logs -> 2020 -> Fri, 15 May 2020

< code.20200514.log - code.20200516.log >

--- Log opened Fri May 15 00:00:13 2020
00:28		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!]
00:32		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code
00:32		mode/#code [+o celticminstrel] by ChanServ
00:45		Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed]
00:45		Vorntastic [uid293981@Nightstar-h2b233.irccloud.com] has quit [[NS] Quit: Connection closed for inactivity]
01:36		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!]
01:40		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code
01:40		mode/#code [+o celticminstrel] by ChanServ
01:40		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!]
01:41		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code
01:41		mode/#code [+o celticminstrel] by ChanServ
01:42		Kindamoody is now known as Kindamoody\|zZz]
02:48		Alek [Alek@Nightstar-o723m2.cicril.sbcglobal.net] has quit [Ping timeout: 121 seconds]
02:53		Alek [Alek@Nightstar-o723m2.cicril.sbcglobal.net] has joined #code
02:53		mode/#code [+o Alek] by ChanServ
04:23		Degi [Degi@Nightstar-hnn55p.dyn.telefonica.de] has quit [Ping timeout: 121 seconds]
04:27		Degi [Degi@Nightstar-vq98oj.dyn.telefonica.de] has joined #code
04:53		Vorntastic [uid293981@Nightstar-ks9.9ff.184.192.IP] has joined #code
04:53		mode/#code [+qo Vorntastic Vorntastic] by ChanServ
05:50		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has quit [[NS] Quit: And lo! The computer falls into a deep sleep, to awake again some other day!]
06:26		catalyst [catalyst@Nightstar-3cqj7v.dab.02.net] has joined #code
06:28		catalyst_ [catalyst@Nightstar-1dili4.dab.02.net] has quit [Ping timeout: 121 seconds]
07:42		Kindamoody\|zZz] is now known as Kindamoody
07:57	<&McMartin>	Eesh.
07:57	<&McMartin>	One of the companies I'm interviewing with casually suggested doing some drills with codewars.com before the interview
07:57	<&McMartin>	This stuff is an insult
07:58	<&McMartin>	Most of the early stuff is ripped straight out of Exercism, and half the UI is dripping with testosterone poisoning
08:33		Kindamoody is now known as Kindamoody\|afk
08:47		mac is now known as macdjord\|slep
09:03	< catalyst>	I find those kinds of sites really frustrating
09:03	< catalyst>	I don't code that way
09:04	< catalyst>	the fact that companies use those kinds of things as a first instance interview technique is one of the reasons I'm scared to try and find a new job
10:52	<&jeroud>	Doesn't codewars.com predate exercism?
10:53	<&jeroud>	I may be thinking of a different thing, though.
10:55	<&jeroud>	Reiver: One of the standard solutions to the problem of "might crash before releasing the lock" is to use a lease instead. You write a timestamp with the RUNNING state and the next process can check if the timestamp is older than ten minutes (or timeout seems reasonable) and assume the previous run crashed (or timed out) without unlocking.
10:57	<&jeroud>	*whatever timeout seems reasonable
10:59	<&jeroud>	Oh, having looked at the codewars website it's definitely a different thing I was thinking of.
11:08	<&jeroud>	Also, having spent a bunch of time in the "cloud/container infrastructure" world recently, Reiv's horrible baton system looks a whole lot better than a lot of "industry standard" things I have to deal with.
11:08	<&Reiver>	jeroud: That's actually what the original code did, in fact
11:09	<&Reiver>	It would check for RUNNING and Timestamp > 15 Minutes Old
11:09	<&jeroud>	Ah. Why the change?
11:10	<&Reiver>	Because the fucking scripting language could not be trusted to handle times properly, and would get confused as to what "Fifteen minutes ago" meant, no matter how you algebrae'd it.
11:10	<&jeroud>	Bleh.
11:11	<&Reiver>	And it then became apparent in acutal use that no /genuine/ run took more than 3 minutes by standard, and the single longest exception was 7, and a crashout
11:12	<&Reiver>	And acutal use also demonstrated we could trust the system /slightly/ more than hoped, so that a ten minute spacing was reliable enough, and also we discovered it would simply crash in a /different/ way even if there was colissions (Discovered by deliberately setting two of them loose simultaneously), so we just kinda... shrugged
11:13	<&Reiver>	(Two simultaneous: One would run, release the file, grab the next, and the next one after would pick up the just-run file and process it too... on the same parameters, so thus producing the same output, because the file hadn't changed yet. Until the first one hit a Really Big File, then the second one would hit a permissions lock, crash out, and ... welp?)
11:13	<&jeroud>	I wish I could do that with my current horrible problem.
11:13		* Reiver prefers not to run systems based on Scripts Crashing Out, hence the "If you spot it previously running, hardcode it back to idle, go to sleep and try again in 5 minutes"
11:14	<&Reiver>	Thus buying 10 minute seperation between Suspected Crash and Next Attempt
11:14	<&Reiver>	But it was something I was comfortable dealing with when I knew that the scripts couldn't actually /break/ on collisions, either.
11:16	<&jeroud>	I have a Very Big Database. It contains all messages and metadata from about a dozen WHO COVID-19 WhatsApp services.
11:16	<&Reiver>	jinkies
11:16	<&Reiver>	(Also: Props to you)
11:16	<&jeroud>	It grows as a rate of about a million messages a day.
11:16	<&Reiver>	(Though for once in my life I can say I was Doing Useful Things Too; as I helped hammer together systems for our own COVID response.)
11:17	<&jeroud>	We cannot do analytics or reporting on it, because Big.
11:17	<&Reiver>	(... even if they were only online dog registration forms, to allow people to register doggies without coming on-site. It counts!)
11:17	<&jeroud>	So we have to push all the data to a different thing for that.
11:17	<&Reiver>	That does sound like something that would need agressive management, yes
11:18	<&jeroud>	The correct way to do this would be for the app to shove events into a queue.
11:19	<&jeroud>	Except the app team are both very busy working on somewhat more immediate problems.
11:19	<&jeroud>	(Yes, both. The app team is two people.)
11:20	<&jeroud>	Since this is in AWS, we need to get the data from RDS (managed postgres, basically) to redshift (looks like postgres, but designed more for analytics than operstions).
11:21	<&Reiver>	... and you need to do this /while/ the firehose of information comes flooding in.
11:21	<&Reiver>	yey
11:21	<&jeroud>	The standard way to do this is with the aptly-named Database Migration Service.
11:22	<&jeroud>	Which can do streaming replication stuff, but only if you flip scary switches on your primary db. Which we aren't willing to do.
11:24	<&jeroud>	So instead I wrote some code to manage DMS tasks (which is a nightmare story for another time) and do periodic migrations of data filtered by the inserted_at column that helpfully exists in all tables.
11:25	<&Reiver>	Why are the switches scary, in this case?
11:25	<&jeroud>	(Most of the tables are small enough that we can just dump and replace them, but several need batch updates instead.)
11:26	<&jeroud>	The switch is turning on "logical replication" which, unlike the usual postgres streaming replication, writes a description of each update operation as it happens.
11:26	<&jeroud>	If we'd started with this, it wouldn't be scary.
11:27	<&jeroud>	But I only realised that our analytics wouldn't work on a normal replica a few days in when hourly queries were already taking ten minutes or more to run.
11:29	<&jeroud>	At which point making a potentially significant operational change to the primary db without much in the way of docs around what the impact would be seemed unwise.
11:29	<&jeroud>	(If we could do that from a replica instead of the primary, it wouldn't have been an issue.)
11:30	<&jeroud>	Anyway, hourly DMS runs to migrate the data looked like a tolerable workaround.
11:30	<&jeroud>	And it was fine for a while.
11:31	<&Reiver>	Problem is, your Time To A Million is measured in hours, yeah
11:31	<&jeroud>	Until tasks started failing sporadically.
11:31	<&jeroud>	Hours, ha!
11:33	<&jeroud>	In the half a day before this thing launched officially we already had about five million interactions, which is ten million messages.
11:33	<&jeroud>	(It dropped off significantly once the novelty wore off, which took a couple of weeks.)
11:34	<&jeroud>	Anyway, we now have DMS tasks that sometimes fail.
11:34	<&Reiver>	painful
11:34	<&jeroud>	And the filters on data only look at the source db, not the destination.
11:35	<&jeroud>	So if we migrate data that has already been migrated, we get duplicates.
11:35	<&Reiver>	oof
11:36	<&Reiver>	... yeah, ironically enough that's exactly what my Horrible Baton System was built to handle
11:36	<&jeroud>	So when something fails we need to delete any partial data before we restart.
11:36	<&Reiver>	"I need arbitary history and I don't trust anything at either end but also can't just copy everything every time, I will give you data until you can promise me you've seen the last lot"
11:36	<&Reiver>	There would be more aggressive methods for trimming, but in my scenario it was fine enough to just trim the file once it arrived
11:37	<&Reiver>	You're in a rough one there, I feel for you lad
11:37	<&jeroud>	Also, DMS filters are all inclusive. You get ">=", "<=", or "between".
11:38	<&jeroud>	So if you happen to have a timestamp that falls exactly on a migration boundary, you get duplicates anyway.
11:40	<&jeroud>	I have now spent way too much time building a thing that talks to both DMS and redshift, unconditionally deletes everything from the target tables that has a timestamp >= the start time, and then runs the tasks.
11:40	<&jeroud>	But detecting failures is... complicated.
11:41	<&jeroud>	So while my failure detection works so far, I'm not confident enough to rely on it to run unsupervised.
11:42	<&jeroud>	And if there's replication lag, we can easily get a successful run that misses the last ten minutes of data for the hour because that hasn't yet reached the replica we're migrating from.
11:43	<&jeroud>	So I must run this thing manually several times a day.
11:43	<&jeroud>	And watch for failures so I can rerun if necessary.
11:46	<&jeroud>	Oh, another fun fact: There are no indexes on the inserted_at columns.
11:48	<&Reiver>	>_<
11:48	<&Reiver>	Yeah of all the columns to have had indexed...
11:48	<&jeroud>	So the DMS tasks end up doing full table scans of up to a hundred million rows to get the hundred thousand or so that actually need to be sent.
11:48	<&Reiver>	Primary key, and /timestamp/ goddamnit people
11:48	<&Reiver>	Pity you can't manipulate pagination or the like
11:49	<&Reiver>	But that is dark sorcery I dare not touch myself anyway
11:49	<&jeroud>	To be fair, this database was never intended to be used for analytics.
11:50	<&Reiver>	no, but analytics follow rapidly behind any sufficiently sized data set
11:50	<&Reiver>	So it is generally wise to plan for them from the start >_>
11:51	<&jeroud>	The primary keys for the two biggest tables are autoincrement integers, so as of yesterday I include a filter on that with a hard-configured minimum.
11:51	<&Reiver>	Well hey, it's not a bad start
11:51	<&Reiver>	If you can tune it more closely, you can manipulate that further to be your baton, even if it's not time based
11:52	<&jeroud>	We'll have to manually update the config every few million rows, but at least the query optimizer is smart enough that it's never actually slower.
11:52	<&Reiver>	(Or if not true baton, then at least be a steadily updated cutoff)
11:53	<&Reiver>	Is your data recorded continiously, or are your primary keys non unique and/or nonconsecutive?
11:53	<&Reiver>	(I know you said they're autoincrement, but ha ha ha I never assume the three are inexorably linked these days, boy do I have tales to tell you -_-)
11:54	<&jeroud>	These are postgres autoincrement primary key things.
11:54	<&Reiver>	I mean, as long as every time you run it you know what the new maximum is, you can min that on the next set of data, or even min+1 if it helps your cutoffs
11:55	<&Reiver>	Provided there is /something/ that 'bigger means later' and 'numbers are indexed', you should be able to use it as a cutoff even if it's not a timestamp
11:55	<&Reiver>	You only care about order of data, not actual age, after all
11:55	<&jeroud>	I think that means monotonically increasing as long as nothing ever sets a value for it on anything.
11:57	<&Reiver>	Then yeah, you should be able to use that as a cutoff for "Give me data after X", even if it's not "After X-oclock"
11:57	<&Reiver>	Just hang onto the last Biggest Value you saw, use it as your new next minimum.
11:57	<&jeroud>	We could probably look at the latest id in each batch table in the destination db and build the DMS tasks from that, but it adds more complexity than I'm comfortable with.
11:57	<&Reiver>	It is, in theory, only managing the one variable.
11:58	<&Reiver>	Well, two, I guess.
11:58	<&Reiver>	This Time and Last Time
11:58	<&Reiver>	But I also appreciate you're playing with an extremely large production system without a lot of testing support, it seems like
11:58	<&jeroud>	It wouldn't help with partial failures, though, because there's no ordering guarantee in the target or the migration.
11:59	<&Reiver>	But I will say that if you are updating the number manually, you might as well automate it, and 'the point the hunk of data gets shifted over' is the right time to grab the New Biggest Number out.
11:59	<&Reiver>	Ah, you could have a partial failure that dumps out 1 2 3 5 6 and lost 4 and 7 both?
12:00	<&jeroud>	Yeah.
12:00	<&jeroud>	But that shows up as failure at least, so we can delete the whole range and retry.
12:01	<&Reiver>	Right
12:02	<&Reiver>	And at least you're only taking From That Last Cut again.
12:02	<&jeroud>	We still need an end time, though, because we don't support updates to already-migrated data, so we leave a few minutes for things like delivery statuses to arrive.
12:02	<&Reiver>	That's fine, actually
12:03	<&jeroud>	So we'll need to filter on that anyway.
12:03	<&Reiver>	Grab cutoff, grab max, return row at 90th percentile between them, check its timestamp. Old enough? Push on!
12:04	<&Reiver>	That way you only hit the time column /after/ scanning the indexed primary keys >_>
12:04	<&jeroud>	Which means the only real benefit we get is reduced runtime, and that's offset by having to query the target for max ids before we start.
12:05	<&Reiver>	Ooooor you just ask the system what it's current variable is up to
12:05	<&Reiver>	I don't know postgres explicitly, but I sure as heck know you can query an Sequence in Oracle directly to ask what it's up to
12:06	<&Reiver>	You don't even care if it's a couple rows out of date by the time you've done your math, you only needed it for a range estimation anyway
12:08	<&Reiver>	You can even get pointlessly clever with the math on that one, but I suspect that's getting into technowank rather than strictly useful
12:09	<&Reiver>	("How long ago since we last successfully ran? Okay, so ten minutes would be what percentage? Excellent, let's call that percentile of the data minus some padding, and check the age. If it's too young, push back whatever% and try again...")
12:11	<&jeroud>	That's much more complexity than I'm comfortable with in a system like this, especially since the only benefit we get is slightly reduced runtime.
12:11	<&Reiver>	The latter bit in brackets was explicitly overkill
12:11	<&Reiver>	What I'm trying to save, here, is database scans
12:12	<&jeroud>	Also, it takes DMS a minute or two to update the tasks.
12:12	<&Reiver>	How much of it is needed depends entirely on what level of scaling you're trying to grapple with, but I figure noting that you /can/ get a minimum and a maximum without actually scanning the full database ahead of time may be of value to you in case it turns out every single query is a ballbreaker. :)
12:13	<&Reiver>	And that you /can/ even then use this to find what /should/ be a useful looking cutoff point, without /ever/ doing a full scan on the timestamp columns, is an option if full scans are causing you performance/reliability problems.
12:13	<&jeroud>	With filters based purely on configs (for the minimum ids) and inputs (for the time ranges) I can do that in parallel with deleting any potential duplicates from the target.
12:15	<&jeroud>	The full scans are only a performance issue. The most common failure seems to be a race condition outside our control.
12:16	<&jeroud>	The target side of the migration works by writing a file full of rows to S3 and then telling redshift to bulk-import that file.
12:18	<&jeroud>	Sometimes redshift asks S3 for the file before the part of S3 that redshift is talking to hears about the file from the part of S3 that DMS is talking to.
12:18	<&jeroud>	At least, that's what I think is happening.
12:19	<&jeroud>	So it's effectively random, although failures seem to be temporally clustered just enough that I suspect some kind of S3 load issue or something.
12:20	<&jeroud>	Without the config-based id filtering, a batch load takes 8-9 minutes.
12:21	<&jeroud>	With the config-based id filtering, it takes 3-4 minutes.
12:21	<&jeroud>	Currently.
12:22	<&Reiver>	gotta love them race conditions :/
12:22	<&Reiver>	You're staging files, right? One location takes the output, then moves it to an input location before the next system can touch it?
12:23	<&Reiver>	Not always needed, ofc, but in our case 'both systems were trying to write and read the file simultaneously' was a common form of system failure -_-
12:23	<&jeroud>	Nope, we're gluing opaque cloud services together.
12:23	<&Reiver>	Yeah that was our problem too
12:23	<&Reiver>	Couldn't control the export or import
12:23	<&Reiver>	So we exported out, then shifted files over when they were done before they did the next bit.
12:24	<&jeroud>	We could have built our own from scratch, but DMS handles all the necessary type conversions and stuff for us.
12:24	<&Reiver>	Yeah that's fair
12:24	<&Reiver>	(Gods, our system ended up such a rube goldberg machine just dealing with the limitations of every step of the process...)
12:27	<&jerith>	At this point, though, it would probably have taken us less time to build our own.
12:28	<&jerith>	But I wasn't expecting DMS to be quite this godsafully terrible.
12:28	<&jerith>	+w
12:31	<&Reiver>	If you ever get suggested to use a peice of software called Attunity
12:31	<&Reiver>	For anything other than multi-gigabyte file transfers that for some reason can't be trusted to SFTP
12:31	<&Reiver>	run.
12:31	<&Reiver>	Run very fast indeed.
12:31	<&Reiver>	It does not do the mirroring they claim they do, and is responsible for a full two thirds of our rube goldberging to suit~
12:33	<&jerith>	Heh. File transfers.
12:34	<&jerith>	We have had to explain to two government entities that FTP is a terrible idea even if it's over ipsec, FTPS is even worse, and SFTP doesn't need all the ipsec madness because it's already ssh.
12:36	<&jerith>	One of them has thankfully gone away. The other is now switching to HTTPS uploads (which they're calling an "API", but still).
12:37	<&jerith>	Except they don't trust CA-issued certs, so they want a copy of the cert we'll be using to have their system accept only that cert.
12:38	<&jerith>	Which means that we can't haz letsencrypt and automated renewals for this.
12:38	<&ToxicFrog>	Wait, how is FTPS even worse than unencrypted FTP?
12:38	<&jerith>	FTPS over ipsec is worse than FTP over ipsec.
12:40	<&jerith>	Because if your channel is encrypted already, all FTPS adds is complexity.
12:41	<&ToxicFrog>	Oh, I missed that you were comparing them both on the assumption that ipsec is in use
12:44	<&jerith>	Our current solution (which we're probably in the middle of deploying, I don't work Fridays) is to stick a self-signed cert on self-signed.project.domain and have our load balancer point that to the same backend as project.domain. They seem happy with that.
12:49		celticminstrel [celticminst@Nightstar-nuu42v.dsl.bell.ca] has joined #code
12:49		mode/#code [+o celticminstrel] by ChanServ
12:56		catalyst_ [catalyst@Nightstar-sg536p.dab.02.net] has joined #code
12:59		catalyst [catalyst@Nightstar-3cqj7v.dab.02.net] has quit [Ping timeout: 121 seconds]
13:07		VirusJTG [VirusJTG@Nightstar-42s.jso.104.208.IP] has quit [Connection closed]
13:22		VirusJTG [VirusJTG@Nightstar-42s.jso.104.208.IP] has joined #code
13:22		mode/#code [+ao VirusJTG VirusJTG] by ChanServ
13:36	< catalyst_>	I've had 420 downloads
13:36	< catalyst_>	Nice.
13:41	<~Vorntastic>	Nice
13:50	<&ToxicFrog>	n i c e
15:04		Kindamoody\|afk is now known as Kindamoody
15:14		Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code
15:15		mode/#code [+qo Vornicus Vornicus] by ChanServ
15:57		bluefoxx [fuzzylombax@Nightstar-gmbj85.vs.shawcable.net] has quit [Connection closed]
16:02		bluefoxx [fuzzylombax@Nightstar-gmbj85.vs.shawcable.net] has joined #code
16:13		Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Connection closed]
16:49		Emmy [Emmy@Nightstar-9p7hb1.direct-adsl.nl] has joined #code
17:13		Vorntastic [uid293981@Nightstar-ks9.9ff.184.192.IP] has quit [[NS] Quit: Connection closed for inactivity]
19:29		catalyst_ [catalyst@Nightstar-sg536p.dab.02.net] has quit [Ping timeout: 121 seconds]
20:48		Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has joined #code
20:48		mode/#code [+qo Vornicus Vornicus] by ChanServ
22:28		catalyst [catalyst@Nightstar-tnqd6v.dab.02.net] has joined #code
23:22		Syloq [Syloq@NetworkAdministrator.Nightstar.Net] has joined #code
23:22		mode/#code [+o Syloq] by ChanServ
23:26		Emmy [Emmy@Nightstar-9p7hb1.direct-adsl.nl] has quit [Ping timeout: 121 seconds]
23:28		Vornotron [Vorn@ServerAdministrator.Nightstar.Net] has joined #code
23:31		Vornicus [Vorn@ServerAdministrator.Nightstar.Net] has quit [Ping timeout: 121 seconds]
23:37		Kindamoody is now known as Kindamoody\|zZz]
--- Log closed Sat May 16 00:00:15 2020

code logs -> 2020 -> Fri, 15 May 2020

< code.20200514.log - code.20200516.log >

[ Latest log file ]